Module 1: Collecting, Summarizing, and Visualizing Data

Video 1.1 Prior Readings

These videos reference the following articles:

  1. Los Angeles Times 2008 article Study finds hospitals slow to defibrillate

  2. Gallup 2018 article Americans Hit the Breaks on Self-Driving Cars

Video 1.1 Learning Outcomes

  1. Explain why data collected in a sample are or are not representative of a larger population.
  2. Describe how convenience samples, voluntary response, nonresponse lead to samples that are not representative of a larger population.
  3. Identify confounding variables in a study.
  4. Distinguish between an observational study and experiment, and determine when it is appropriate to infer that a causal relationship exists.
  5. Define randomization, control, replication, and blocking, and explain why these are important in the design of an experiment.
  6. Explain the placebo effect and the importance of blind and double-blind experiments.
  7. Analyze ethical concerns associated with studies.

Designing An Observational Study

  • In order to generalize results from a sample to a larger population, the sample must be chosen in a way that is representative of the population.

  • Sampling bias occurs when certain individuals or groups are more likely to be included in a study than others
    -Ex: sample only engineering students about self-driving cars

  • Voluntary response and non-response bias occur when only a small percentage of people selected in a sample respond. Those who respond might be systematically different than those who do not.
    -Ex: respondents might have stronger opinions (or more time on their hands) than others

  • Researchers should randomly select participants and follow up using multiple methods to reach as many individuals as possible

Confounding Variable

A confounding variable is a variable related to both the explanatory and response variable, so that its effects cannot be separated from the effects of the explanatory variable.

Observational Studies and Experiments

  • An observational study is a study in which researchers observe individuals and measure variables of interest but do not intervene in order to attempt to influence responses

  • An experiment is a study in which experimental units are randomly assigned to two or more treatment conditions and the explanatory variable is actively imposed on the subjects

In an observational study, we can never conclude that one variable causes a change in the other due to the possibility of confounding variables

In an experiment, we may conclude that one variable causes a change in the other since we have controlled for confounding variables

Principles of Good Experiments

  1. Control: Researchers assign subjects to different treatments to control for differences between groups.

  2. Randomization Subjects are randomly assigned to groups so that there are no systematic differences between groups, which could introduce confounding factors.

  3. Replication The more subjects are studied, the more precisely we can estimate effects being studied.

  4. Blocking Some experiments group patients with similar characteristics, such as good health or poor health before assigning treatments. This assures that each treatment group has the same number of good health and poor health treatments.

  5. A placebo is a fake treatment given to account for the possibility of subjects experiencing an effect simply from believing they received a treatment.

  6. A double blind experiment is one in which neither the subjects nor the people administering the treatment know whether the subject received a treatment or placebo

Inferring Causation and Generalizing Results

  • We can only generalize from sample to population when a sample is randomly selected.

  • We can only infer causation when using a randomized experiment.

Treatments Randomly Assigned Treatments Not Randomly Assigned
Sample Randomly Collected Can infer causation and generalize to population Can generalize results
Sample Not Randomly Collected Can infer causation Cannot infer causation or generalize results

Video 1.2 Learning Outcomes

  1. Distinguish between categorical and quantitative variables.
  2. Interpret results displayed in bar graphs, histograms, boxplots, and scatterplots.
  3. Describe the center, shape, variability, and outliers in a distribution.

2018 Movies

We will look at a dataset with information on 272 movies released in 2018, which was obtained from https://www.imdb.com/.

We have information on each film’s

  • IMDB score (score of 1-10 by users of IMDB.com)
  • MPAA Rating (PG, PG-13, R, Not Rated)
  • Genre (Action, Comedy, Drama, etc.)
  • Runtime (in minutes)
  • Revenue (worldwide revenue in millions)

Biggest Moneymakers

Movies that generated more than $250 million in revenue:

##                            Title IMDB Rating     Genre Runtime Revenue
## 1                  Black Panther  7.3  PG-13    Action     134  700.06
## 2         Avengers: Infinity War  8.5  PG-13    Action     149  678.82
## 3                  Incredibles 2  7.7     PG Animation     118  608.58
## 4 Jurassic World: Fallen Kingdom  6.2  PG-13    Action     128  417.72
## 5                        Aquaman  7.2  PG-13    Action     143  334.14
## 6                     Deadpool 2  7.8      R    Comedy     119  324.59
## 7                     The Grinch  6.3     PG    Family      86  270.60

Observational Units and Variables

The rows of the datasets are called observational units.
-Films are the observational units in this dataset.

The columns of the datasets are called variables.

A quantitative variable is one that takes on numeric values
- Examples: IMDB, Runtime, Revenuw

A categorical variable is one where outcomes are a set of categories
-Examples: Rating, Genre

Movie Genres

Bar graphs are used to display frequencies for categorical variables.

Genre and Rating

Stacked bar graphs display information on 2 categorical variables such as Genre and Rating

IMDB Scores

Histograms and boxplots are used to display quantitative variables.

In a histogram, the x-axis contains numbers, rather than categories.

  • movies most often got scores between 6 and 7.5.
  • middle 50% of movies scored between 5.7 and 7.1 (region in box)
  • A few movies scored much lower than the rest (outliers)

Comparing IMDB Score and Runtime

Scatterplots display the relationship between two quantitative variables.

  • Longer movies tend to get higher scores, on average.

Comparing Revenue and IMDB Score

  • There does not appear to be a relationship between Revenue and IMDB score

Video 1.3 Learning Outcomes

  1. Explain what mean, median, interquartile range, and standard deviation tell us about a dataset.
  2. Explain the impact of outliers on the mean and median of a dataset.
  3. Compare measures of center and spread between different datasets.
  4. Assess the amount of variability in data displayed graphically, or described in words.

Describing Center

median(movies$IMDB)
## [1] 6.5
mean(movies$IMDB)
## [1] 6.408088

In roughly symmetric dataset, the mean and median are approximately the same.

Movie Revenues

This distribution is said to be right-skewed. Most movies made less than $50 million, but a few made much more, creating a “tail,” or long “whisker” going to the right.

Mean and Median Movie Revenues

median(movies$Revenue)
## [1] 6.605
mean(movies$Revenue)
## [1] 41.55228

In a right-skewed distribution, the mean is considerably larger than the median, since a few very large observations pull the mean up considerably, but don’t change the middle number in the dataset. In these situations, the median is usually a better indicator of a “typical” value.

Comparison of Dramas and Comedies

  1. On average, dramas score higher than comedies.
  2. There is more variability in scores for comedies than dramas.

Statistics to Describe Center and Variability

In addition to graphics, we can use statistics to describe the amount of variability in a dataset.

Common Measures of Center:

  • Mean
  • Median

Common Measures of Variability:

  • Interquartile range (IQR) - range of middle 50% of data, i.e. width of the box
  • Standard deviation - roughly the average difference between individual observations and the mean

The higher the IQR or standard deviation, the more variability in the data.

Summary Statistics for PG and PG-13 Runtimes

Genre Mean median IQR stdev
Comedy 6.383721 6.5 1.15 0.9928147
Drama 6.731429 6.8 0.70 0.8112341
  • Dramas tend to score higher on average.
  • There is more variability in scores for comedies than for dramas.

Module 2: Inference for Categorical Data

Section 2.1 Learning Outcomes

  1. State the parameter of interest in a study.
  2. State appropriate null and alternative hypotheses, using both words and symbols.
  3. Identify sample statistics.

Can Dolphins Communicate?

Example from Chapter 1 of Introduction to Statistical Investigations by Tintle et al. 

Image from [http://www.bbc.com](http://www.bbc.com/future/story/20130613-decoding-the-language-of-dolphins)

Image from http://www.bbc.com

  • Pioneer research in marine biology (1960’s) studied whether dolphins could communicate with each other beyond relaying simple feelings.

  • Dr. Jarvis Bastian conducted experiment involving two dolphins, Doris and Buzz.

Doris and Buzz Study

Image from Introduction to Statistical Investigation by Tintle et al.

Image from Introduction to Statistical Investigation by Tintle et al.

  • Two buttons and a light were placed underwater. Dolphins trained to push right button when light shone steadily and left button when light blinked.

  • After mastering task, dolphins were separated by curtain placed through the center of the pool. Only Doris could see light, and only Buzz could push button, but they could hear each other’s sounds.

  • After seeing light, Doris would whistle to Buzz, who would press button. If Buzz pushed correct button, both dolphins were rewarded with fish.

Doris and Buzz Study

  • In 16 attempts, the dolphins got the button right 15 times.

Samples, Statistics, and Parameters

  • A sample is a subset of individuals or outcomes of interest.
    • Example: Set of 16 attempts by Doris and Buzz
  • A statistic is a numerical summary of the sample.
    • Example: The proportion of attempts that Doris and Buzz got right.
    • When the statistic is a proportion, we denote it using the symbol \(\hat{p}\), so here \(\hat{p}=\frac{15}{16}=0.9375.\)
  • A parameter is the long-run numerical property of the process or population.
    • Example: The proportion of times Buzz would choose the right button if the experiment were repeated many, many times.
    • When the parameter is a proportion, we denote it using the letter \(p\).

Possible Hypotheses

Does this result provide evidence that the dolphins were actually communicating effectively?

Hypothesis 1:
Buzz was just guessing which button to push.

Hypothesis 2:
Buzz was not just guessing, and was using information from Doris (or possibly another source).

Hypotheses

Does this result provide evidence that the dolphins were actually communicating effectively?

Hypothesis 1: (Null Hypothesis)
Buzz was just guessing which button to push. (\(p=0.5\))

Hypothesis 2: (Alternative Hypothesis)
Buzz was not just guessing, and was using information from Doris (or possibly another source). (\(p>0.5\))

  • The null hypothesis is the “by chance alone” explanation.

  • The alternative hypothesis is another explanation that contradicts the null hypothesis.

Key Question

How likely is it that the dolphins would have gotten 15 or more attempts correct out of 16 if they were just guessing?

  • In general, we need to determine the probability of getting a result as extreme or more extreme than we did (i.e. 15 out of 16 correct) if the null hypothesis is true (that is the dolphins are randomly guessing).

  • We’ll do this by simulating a situation where the null hypothesis is true (a coin flip)

Section 2.2 Learning Outcomes

  1. Describe how to use simulation to test hypotheses.
  2. Determine whether there is evidence to reject a null hypothesis.
  3. Explain the meaning of a p-value in a given context.

Simulating Coin Flips

The following R code will simulate flipping a coin 16 times.

set.seed(09192018)
Flips <- sample(c("H", "T"), prob= c(0.5, 0.5), size=16, replace=TRUE)  
Flips
##  [1] "T" "T" "H" "T" "T" "H" "T" "H" "H" "T" "T" "T" "H" "T" "H" "H"

Number of heads:

sum(Flips == "H")
## [1] 7

10,000 Sets of 16 Coin Flips

Now, we’ll repeat simulating 16 flips 10,000 times, and keep track of the number of heads.

set.seed(09192018)
Heads <- c(rep(NA), 10000)
for( i in 1:10000){
Flips <- sample(c("H", "T"), prob= c(0.5, 0.5), size=16, replace=TRUE)  
Heads[i] <- sum(Flips == "H")
}
Results <- data.frame(Heads)

Histogram of Number of Heads

SimDolphins <- gf_histogram(~Heads, data=Results, 
             bins=17, binwidth = 1, 
             border=0, fill="blue", color="black") + 
  geom_vline(xintercept=15, colour="red")
SimDolphins

In 10,000 simulations how many times did we get 15 or more heads?

sum(Heads >= 15)
## [1] 3

Conclusion

The probability of the dolphins getting 15 or more of the signals correct in 16 flips is approximately \(\frac{3}{10,000}=0.0003\).

There is strong evidence that the dolphins are not just guessing, and may indeed be communicating.

p-value

The p-value is the probability of obtaining a result as or more extreme than we did, when the null hypothesis is true.

Our simulated p-value is \(\frac{3}{10,000}=0.0003\).

The probability of the dolphins getting 15 or more attempts correct if they are just randomly guessing is approximately 0.0003.

p-values and Conclusion

small p-values provide evidence against the null hypothesis!

often, we reject the null hypothesis is the p-value is less than 0.05 (0.1 and 0.01 are also reasonable criteria)

Dolphins Example:

Since the p-value is very small, it is very unlikely that the dolphins would have gotten 15 or more correct if they were randomly guessing.

We reject the null hypothesis. There is strong evidence that the dolphins were not purely guessing.

What if…

What if Buzz had only been right on 10 of the 16 tries. Would this change our conclusion?

Histogram of Results

In 10,000 simulations how many times did we get 10 or more heads?

sum(Heads >= 10)
## [1] 2299

Conclusion for 10 out of 16

The probability of the dolphins getting 10 or more of the signals correct in 16 flips is approximately \(\frac{2,299}{10,000}=0.2299\).

It is plausible that the dolphins would have gotten 10 or more correct by randomly guessing, so in this situation, we would not have evidence that the dolphins are actually communicating. We would not reject the null hypothesis.

We will never accept or prove the null hypothesis. We can only evaluate the strength of the evidence against it.

p, \(\hat{p}\), and p-value

To review,

  • \(p\) represents the unknown “true” probability of a “success” occurring (parameter).
  • \(\hat{p}\) represents the proportion of successes observed in the sample (statistic).
  • the p-value is the probability of observing as many or more successes as we did if the null hypothesis is true. (Note: sometimes the p-value will represent as many or fewer successes, or a number of successes as extreme as we got)

Final Note About Doris and Buzz

In order to make general claims about dolphin communication, this (or similar) experiments need to be conducted on other dolphins than just Buzz and Doris. Subsequent research has provided evidence that dolphins are highly intelligent mammals, capable of high-level communication.

National Geographic Article

Section 2.3 Learning Outcome

  1. State null and alternative hypotheses in tests involving two proportions.
  2. Explain how to use simulation to test hypotheses involving two proportions.
  3. Interpret results of a hypothesis test for a difference between population proportions for two groups.

Trust of Babies

A 2011 study by Wood, titled “Babies Learn Early Who They Can Trust” examined the way that babies responded to adults who either correctly or falsely led them to believe there was something exciting in a box. (Data from Statistics: Unlocking the Power of Data by Lock et al.)

  • 60 babies (age 13-16 months) were divided into 2 groups of 30
  • each baby watched an adult look into a box and become very excited
  • the babies were then shown the box
    - 30 were empty (adult had deceived the baby)
    - 30 contained toys (adult was trustworthy)
  • adult then pushed on a light with forehead
  • researchers recorded how many babies imitated the adult’s behavior

Trust of Babies Results

Imitated Did not imitate Total
Box Contained Toy (adult trustworthy) 18 (0.60) 12 (0.40) 30
No Toy (adult not trustworthy) 10 (0.33) 20 (0.67) 30

Do you think these results provide evidence of a difference in the way the babies respond?

Possible Explanations

Explanation 1:

Whether or not a baby presses the light has nothing to do with the adult’s behavior. More of the babies who were inclined to press the button happened to be assigned to the group with “trustworthy adults,” by pure chance.

Explanation 2:

The observed difference between the groups is due to more than just pure chance. (Perhaps explained by the behavior or “trustworthiness” of the adult.)

Hypotheses

Define:

\(p_1\): proportion of all babies who would imitate a “trustworthy” adult.

\(p_2\): proportion of all babies who would imitate a “non-trustworthy” adult.

Null Hypothesis: There is no difference between the proportion of all babies who would imitate “trustworthy” and “non-trustworthy” adults (\(p_1=p_2\) or \(p_1-p_2=0\))

Alternative Hypothesis: Babies are more likely to imitate “trustworthy” adults than non-trustworthy ones. (\(p_1>p_2\) or \(p_1-p_2>0\))

Sample Statistics and Key Question

Sample statistics: \(\hat{p}_1=\frac{18}{30}=0.6\), \(\hat{p}_2=\frac{10}{30}\approx0.3333\)

\(\hat{p}_1-\hat{p}_2\approx0.2667\)

How likely is it that we would get a difference in proportions as large as 0.2667 if there is really no difference in the way babies respond to “trustworthy” and “non-trustworthy” adults?

How might we simulate a situation where there is really no difference in the way the babies respond?

Simulation-Based Hypothesis Test

## [[1]]

## 
## $`Observed Difference in Proportions`
## [1] 0.2666667
## 
## $`Simulation-based p-value`
## [1] 0.025

Conclusion

The p-value represents the probability of getting a difference in sample proportions as large as 0.2667 if there is really no difference between the proportion of all babies who would imitate “trustworthy” and “non-trustworthy” adults.

Since our observed difference is extreme and the p-value is low (about 0.03), our results are not consistent with the hypothesis that there is no difference. The data provide evidence that the difference between the groups is not due to chance alone.

The extent to which we attribute this to the “trustworthiness” of the adult depends on the nature of the experiment. Since “trustworthiness” was established in a very specific way, we should be careful not to overgeneralize.

Video 2.5

  1. Explain the impact of sample size and difference between observed statistic and hypothesized value of parameter on p-value.

Evidence and p-values

The strength of evidence against a null hypothesis is depends on both:

  1. the difference between the observed statistic \(\hat{p}\) and the hypothesized value \(p\).

  2. the sample size.

Strength of Evidence Example

Null Hypothesis: For all AP test questions, probability that B is correct is 1/5 = 0.2 (\(p=0.2\))

Alternative Hypothesis: For all AP test questions, probability that B is correct is greater than 0.2 (\(p>0.2\))

Sample Size and Size of Difference

  • As the difference between the hypothesized and true values increases, the evidence against the null hypothesis gets stronger. The p-value gets smaller.

  • As the sample size increases, the evidence against the null hypothesis gets stronger. The p-value gets smaller.

Section 2.4 Learning Outcomes

  1. Calculate a standardized statistic (z-score) for categorical data.
  2. Explain what a z-score tells us about the plausibility of the null hypothesis.

Shape of Sampling Distribution for \(\hat{p}\)

What do you notice about the shape of the sampling distributions we’ve seen so far?

Normal Distribution

In many situations, the sampling distribution for a proportion (or a difference in proportions) can be approximated by a symmetric, bell-shaped curve, known as a normal distribution.

Image from Intro Statistics with Simulation and Randomization by Deitz, Barr, Cetinkaya-Rundel

Standardized Score (z-score)

A standardized score, or z-score is useful for comparing our sample statistic \(\hat{p}\) to the hypothesized value \(p\).

A z-score tells us how many standard deviations a statistic lies away from its hypothesized value.

\(z=\frac{\text{Statistic}-\text{Hypothesized Value}}{\text{Standard Deviation of Statistic}}\)

The standard deviation of a sample statistic is sometimes called the standard error.

z-scores as Evidence

z-scores more extreme than \(\pm 2\) typically provide evidence against the null hypothesis.

z-score for Single Proportion

\(z=\frac{\text{Statistic}-\text{Hypothesized Value}}{\text{Standard Error}}\)

\(\text{Standard Error} = \sqrt{\frac{p(1-p)}{n}}\)

Where \(p\) is the hypothesized value, and \(n\) is sample size.

Thus,

\(z=\frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}}\)

Self-Driving Cars Example

Null Hyp: The proportion of of all US adults who favor riding in self-driving cars is 0.25 (\(p=0.25\)).

Alt. Hyp: The proportion of all US adults who favor riding in self-driving cars different than 0.25 (\(p \neq 0.25\)).

\(\hat{p}=\frac{758}{3297}=0.23\), and \(n=3297\)

\(z=\frac{0.23-0.25}{\sqrt{\frac{0.25(1-0.25)}{3297}}}=-2.6456\)

Plotting Standardized Score

The sample statistic we observed is 2.6 standard errors lower than we would have expected if the null hypothesis is true. This provides evidence against the null hypothesis.

Calculating p-value

p-value:

## [1] 0.003854433

Hypothesis Tests in R

The prop.test command in R can be used to get the p-value from the test.

x = number of “successes”
n= sample size
p= hypothesized parameter quantity in null hypothesis
alternative= either “less” , “greater” or “two.sided” depending on alternative hypothesis

prop.test(x=758, n=3297, p=0.25, alternative="less", conf.level = 0.95, correct=FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  758 out of 3297, null probability 0.25
## X-squared = 7.0999, df = 1, p-value = 0.003854
## alternative hypothesis: true p is less than 0.25
## 95 percent confidence interval:
##  0.0000000 0.2421781
## sample estimates:
##        p 
## 0.229906

Note that the z-statistic we calculated is the square root of the X-squared value in the output.

Simulation-based p-value

## [[1]]

## 
## $`Observed Proportion`
## [1] 0.229906
## 
## $`Simulation-based p-value`
## [1] 0.0036

Confidence Intervals for Proportions

A confidence interval tells us a range in which a parameter could reasonably lie.

An approximate 95% confidence interval is given by

\(\text{Statistic} \pm 2 \times \text{Standard Error}\)

An approximate 95% confidence interval for a proportion \(p\) is:

\(\hat{p} \pm 2 \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

Confidence Interval for Cars Example

A 95% confidence interval for the proportion of all US adults who would likely ride in a self-driving car is:

\(0.23 \pm 2\times \sqrt{\frac{0.23(1-0.23)}{3297}} = 0.23 \pm 0.015.\)

\(0.23-0.015=0.215\) and \(0.23+0.015=0.245\)

Confidence Intervals Directly in R

We can calculate confidence interval directly using the prop.test() function in R.

Make sure to set alternative=“two.sided” when making a confidence interval.

prop.test(x=758, n=3297, conf.level = 0.95,  alternative="two.sided", correct=FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  758 out of 3297, null probability 0.5
## X-squared = 962.07, df = 1, p-value < 0.00000000000000022
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2158625 0.2445781
## sample estimates:
##        p 
## 0.229906

We are 95% confident that the proportion of all US adults who would want to ride in a self-driving car is between 0.216 and 0.245.

Section 2.5 Learning Outcomes

  1. Interpret hypothesis tests for proportions using a theory-based normal approximation.
  2. Determine whether it is appropriate to use a normal distribution to approximate the sampling distribution for a proportion.

Babies Example

Null Hypothesis: There is no difference between the proportion of all babies who would imitate “trustworthy” and “non-trustworthy” adults (\(p_1=p_2\) or \(p_1-p_2=0\))

Alternative Hypothesis: Babies are more likely to imitate “trustworthy” adults than non-trustworthy ones. (\(p_1>p_2\) or \(p_1-p_2>0\))

Sample statistics: \(\hat{p}_1=\frac{18}{30}=0.6\), \(\hat{p}_2=\frac{10}{30}\approx0.33\)

z-score for Difference in Two Proportions

\(z=\frac{\text{Statistic}-\text{Hypothesized Value}}{\text{Standard Error}}\)

\(z=\frac{(\hat{p}_1-\hat{p}_2)-0}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}}\)

Where \(\hat{p}\) is the proportion of overall successes when the groups are combined, and \(n\) is sample size.

Babies example:

\(z=\frac{\frac{18}{30}-\frac{10}{30}}{\sqrt{\left (\frac{18+10}{30+30}\right)\left(1-\left (\frac{18+10}{30+30}\right)\right)\left(\frac{1}{30}+\frac{1}{30}\right)}}\approx2.07\)

Babies Standardized Score

The sample statistic we observed is 2.07 standard errors larger than we would have expected if the null hypothesis is true. This provides evidence against the null hypothesis.

Babies p-value

p-value:

## [1] 0.01921695

Babies Hypothesis Test in R

prop.test(x=c(18, 10), n=c(30,30), alternative="greater", correct=FALSE)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(18, 10) out of c(30, 30)
## X-squared = 4.2857, df = 1, p-value = 0.01922
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.06249661 1.00000000
## sample estimates:
##    prop 1    prop 2 
## 0.6000000 0.3333333

Simulation-based p-value

## [[1]]

## 
## $`Observed Difference in Proportions`
## [1] 0.2666667
## 
## $`Simulation-based p-value`
## [1] 0.025

Confidence Interval for Difference in Two Proportions

\(\text{Statistic} \pm 2 \times \text{Standard Error}\)

\((\hat{p}_1-\hat{p}_2)\pm 2\times\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\)

Confidence Interval in Babies Example

An approximate 95% confidence interval for the difference in proportion of babies who would imitate “trustworthy” vs “non-trustworthy” adults is:

\[(\hat{p}_1-\hat{p}_2)\pm 2\times\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\]

\[ \begin{aligned} &(\hat{p}_1-\hat{p}_2)\pm 2\times\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)} \\ &=\left(\frac{18}{30}-\frac{10}{30}\right)\pm 2\times\sqrt{\frac{18}{30} \left(1-\frac{18}{30}\right) \left(\frac{1}{30}+\frac{1}{30}\right)}\\ &=0.2667\pm2\times\sqrt{0.008}\\ &= 0.2667 \pm 2\times0.179 \end{aligned} \]

Babies Confidence Interval in R

prop.test(x=c(18, 10), n=c(30,30), alternative="two.sided", correct=FALSE)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(18, 10) out of c(30, 30)
## X-squared = 4.2857, df = 1, p-value = 0.03843
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.02338304 0.50995029
## sample estimates:
##    prop 1    prop 2 
## 0.6000000 0.3333333

We are 95% confident that the proportion of babies who would imitate to “trustworthy” adults is between 0.02 and 0.51 higher than for “non-trustworthy” adults.

Normal Approximation Conditions

When the sample size is very small, the normal approximation is often inappropriate.

This is especially a concern when the hypothesized value of \(p\) is close to 0 or 1.

Examples:

Guideline: In order to use the normal approximation there should be at least 10 “successes” and 10 “failures” in each group.

Module 3: Inference for Quantitative Data

Video 3.1 Learning Outcomes

  1. Distinguish between the distribution for a sample mean and the distribution of individual observations.
  2. Explain the impact of sample size on variability of sampling distributions, test-statistics, p-values, and evidence against the null hypothesis.
  3. Determine whether it is appropriate to use a t-distribution to approximate the sampling distribution for a mean.

Mercury Levels in Florida Lakes

A 2004 study by Lange, T., Royals, H. and Connor, L. examined Mercury accumulation in large-mouth bass, taken from a sample of 53 Florida Lakes. If Mercury accumulation exceeds 0.5 ppm, then there are environmental concerns. In fact, the legal safety limit in Canada is 0.5 ppm, although it is 1 ppm in the United States.

Florida Lakes Data

## # A tibble: 6 × 2
##   Lake         AvgMercury
##   <chr>             <dbl>
## 1 Alligator          1.23
## 2 Annie              1.33
## 3 Apopka             0.04
## 4 Blue Cypress       0.44
## 5 Brick              1.2 
## 6 Bryant             0.27

Mercury Levels in Florida Lakes

The histogram shows the distribution of Mercury levels for the 53 lakes.

Mean SD
0.5271698 0.3410356

Key Question

Mean SD
0.5271698 0.3410356

Do these data provide enough evidence to say that the mean mercury concentration for all Florida Lakes is higher than 0.5?

How unusual would it be to get a sample of 53 lakes whose mean mercury concentration is 0.027 higher than the overall average?

To answer this, we need to understand the behavior of sample mean for samples of 53 lakes.

Sampling Distribution of Mean (\(\bar{x}\))

Distribution of Hg Concentration in Individual Lakes:

Distribution of Sample Means: (sampling distribution for the mean)

As sample size increases:

  • The variability in the distribution of \(\bar{x}\) decreases

Conditions for Using t-Distribution for Means

It is reasonable to approximate the distribution of sample means with a symmetric, bell-shaped t-distribution when:

  1. the data in the sample are roughly symmetric OR
  2. the sample size is fairly large (\(n\geq 30\)) and the data are not heavily skewed

Distribution of Sample Mean for Skewed Data

Distribution of revenues for 2018 movies:

Distribution of Sample Means: (sampling distribution for the mean)

As sample size increases:

  • The variability in the distribution of \(\bar{x}\) decreases
  • The sampling distribution of \(\bar{x}\) becomes more bell-shaped

Video 3.2 Learning Outcomes

  1. Perform hypothesis tests and obtain confidence intervals for means using R.
  2. Interpret the results of hypothesis tests and confidence intervals involving quantitative data.

Key Question

Mean SD
0.5271698 0.3410356

Do these data provide enough evidence to say that the mean mercury concentration for all Florida Lakes is higher than 0.5?

How unusual would it be to get a sample of 53 lakes whose mean mercury concentration is 0.027 higher than the overall average?

Parameter, Statistic, and Hypotheses

Parameter of interest: mean mercury level for all lakes in Florida (\(\mu\))

Null Hypothesis: The mean mercury level for all lakes in Florida is 0.5 ppm. (\(\mu=0.5\))

Alternative Hypothesis: The mean mercury level for all lakes in Florida exceeds 0.5 ppm. (\(\mu>0.5\))

Sample Statistic: \(\bar{x}=0.527\)

Categorical and Quantitative Variables

Categorical Quantitative
Data (outcome variable) A category A number
Parameter of interest unknown long-run proportion (\(p\)) unknown overall mean (\(\mu\))
Sample statistic prop. from sample (\(\hat{p}\)) sample mean (\(\bar{x}\))

Standardized Statistics

standardized statistic: \(t= \frac{\bar{x}-\mu}{s\mathbin{/}\sqrt{n}}\)

where:
\(\bar{x}\) is the sample mean
\(\mu\) is the value from the null hypothesis
\(s\) is the sample standard deviation
\(n\) is the sample size

The quantity \(\frac{s}{\sqrt{n}}\) is called the standard error of \(\bar{x}\).

A confidence interval for \(\mu\) is given by:

\(\bar{x}\pm 2\times \frac{s}{\sqrt{n}}\)

For approximate 95% confidence, use \(m=2\)

Florida Lakes Example

Recall the mean and standard deviation in the sample of 53 Florida Lakes.

Mean SD n
0.5271698 0.3410356 53

When testing the null hypothesis \(\mu=0.5\), the standardized statistic is:

\(t= \frac{\bar{x}-\mu}{s\mathbin{/}\sqrt{n}}=\frac{0.527-0.5}{0.341\mathbin{/}\sqrt{53}}=0.58\).

Florida Lakes \(t\)-statistic

The \(t\)-statistic is consistent with what we would expect if the mean level mercury level for all Florida lakes was 0.5 ppm. It is plausible that the mean mercury level is 0.5 ppm.

pt(q=0.58, df=52, lower.tail = FALSE)
## [1] 0.2822097

t-test in R

t.test(x=FloridaLakes$AvgMercury, mu=0.5, conf.level=0.95, alternative="greater")
## 
##  One Sample t-test
## 
## data:  FloridaLakes$AvgMercury
## t = 0.58, df = 52, p-value = 0.2822
## alternative hypothesis: true mean is greater than 0.5
## 95 percent confidence interval:
##  0.4487193       Inf
## sample estimates:
## mean of x 
## 0.5271698

The probability of observing a sample mean as large as 0.58 ppm if the true mean mercury concentration for all Florida lakes was really 0.50 ppm is 0.28 (pretty high).

We do not reject the null hypothesis. There is not evidence to conclude that the mean mercury concentration for all Florida Lakes is more than 0.5 ppm.

Confidence Interval

t.test(x=FloridaLakes$AvgMercury, mu=0.5, conf.level=0.95, alternative="two.sided")
## 
##  One Sample t-test
## 
## data:  FloridaLakes$AvgMercury
## t = 0.58, df = 52, p-value = 0.5644
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
##  0.4331688 0.6211709
## sample estimates:
## mean of x 
## 0.5271698

We are 95% confident that the mean mercury concentration in all Florida lakes is between 0.43 and 0.62 ppm.

Northern vs Southern Florida Lakes

Is average mercury level different for lakes in Northern Florida than Southern Florida?

from Google Maps

from Google Maps

Hypotheses

Null Hypothesis: There is no difference in the mean mercury levels between lakes in Northern and Southern Florida (\(\mu_1=\mu_2\))

Alternative Hypothesis: There is a difference in the mean mercury levels between lakes in Northern and Southern Florida.(\(\mu_1\neq\mu_2\))

Statistics for North vs South Lakes

gf_boxplot(data=FloridaLakes, AvgMercury ~ Location) %>% gf_refine(coord_flip())

Location MeanHg StDevHg N
N 0.4245455 0.2696652 33
S 0.6965000 0.3838760 20

Key Question

How unusual would it be to observe a difference in sample means as large as \(0.6965 - 0.4245 = 0.2720\) if mercury concentrations are really the same in the North as in the South?

We can use a t-test since:
1. Data are roughly symmetric
2. Sample sizes are reasonably large (33 and 20)
3. Samples are independent

Two-sample t-test

A test statistic is:

\(t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\)

A confidence interval for \(\mu_1-\mu_2\) is given by:

\((\bar{x}_1-\bar{x}_2) \pm m\times{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\)

For a 95% confidence interval \(m=2\)

t-test in R

t.test(data=FloridaLakes, AvgMercury~Location, conf.level=0.95, alternative="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  AvgMercury by Location
## t = -2.7797, df = 30.447, p-value = 0.009239
## alternative hypothesis: true difference in means between group N and group S is not equal to 0
## 95 percent confidence interval:
##  -0.4716369 -0.0722722
## sample estimates:
## mean in group N mean in group S 
##       0.4245455       0.6965000

Conclusions

The low p-value of 0.009239 tells us that there is strong evidence against the null hypothesis. We have reason to believe that the average mercury concentration in lakes is not the same as in southern lakes.

We are 95% confident that the average mercury concentration in northern lakes is between 0.47 and 0.07 parts per million less than in southern lakes.

Video 3.3 Learning Outcomes

  1. Recognize when it is appropriate to use a paired t-test, rather than a t-test for independent samples.
  2. Explain the differences in the way that paired t-tests account for variability, compared to t-tests for independent samples.

Rounding First Base

  • Woodward (1970) studied whether a “wide” or “narrow” angle when rounding first base provides an advantage for baseball players, running to second base.
  • 22 runners were timed using each approach with rest in-between.
  • Woodward randomly determined which angle the player would take first.
Image from Introduction to Statistical Investigations by Tintle et al.

Image from Introduction to Statistical Investigations by Tintle et al.

Hypotheses

Null Hypothesis: There is no difference between average running times using the wide and narrow angles.

Alternative Hypothesis: There is a difference between average running times using the wide and narrow angles.

How is this question similar to the comparison of mercury levels in lakes in Northern vs Southern Florida? How is it different?

Paired Data

To this point, we have assumed that we are working with independent data.

In the lakes example, we observed a different set of lakes in Northern Florida than in Southern Florida. The samples were independent.

Here, we observed the same runners twice, once using each type of angle. We expect runners who are faster using one angle to also be faster using the other angle, so the samples are not independent.

Comparison

Do you think the data provide evidence that one running strategy is better than the other? Why or why not?

Angle mean St.Dev n
narrow 5.534091 0.2597555 22
wide 5.459091 0.2728319 22

Average difference: 5.534-5.459 = 0.075 sec. 

More Information

We now match the times of each runner individually.

  • 17 of the runners ran faster using the wide angle.
  • 5 of the runners ran faster using the narrow angle.

Does this information change your thoughts on whether there is evidence that one strategy is better? Why or why not?

Tests for Paired Data

When we have multiple observations on the same subjects, we find the difference between each subject individually, and then perform a 1-sample t-test on the differences.

Runner narrow wide Difference
1 5.50 5.55 -0.05
2 5.70 5.75 -0.05
3 5.60 5.50 0.10
4 5.50 5.40 0.10
5 5.85 5.70 0.15
6 5.55 5.60 -0.05

Paired t-test in R

t.test(x=Baserunners$narrow, y=Baserunners$wide, paired = TRUE)
## 
##  Paired t-test
## 
## data:  Baserunners$narrow and Baserunners$wide
## t = 3.9837, df = 21, p-value = 0.0006754
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.03584814 0.11415186
## sample estimates:
## mean difference 
##           0.075

p-value and Conclusion

The p-value represents probability of observing a mean difference as extreme or more extreme than 0.075 if there is really no difference between average times using narrow and wide angles.

Because the p-value is very small (0.0006754), we have strong evidence of differences in running times.

We are 95% confident that it takes between 0.036 and 0.11 seconds longer to run using the narrow approach than the wide approach.

Independent sample t-test

THIS IS NOT THE CORRECT WAY TO ANALYZE THESE DATA!!!

t.test(x=Baserunners$narrow, y=Baserunners$wide)
## 
##  Welch Two Sample t-test
## 
## data:  Baserunners$narrow and Baserunners$wide
## t = 0.93383, df = 41.899, p-value = 0.3557
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.08709334  0.23709334
## sample estimates:
## mean of x mean of y 
##  5.534091  5.459091

Summary Data

Test for paired differences:

mean St.Dev n
0.075 0.0883041 22

.
.
.
.
\(t=\frac{\bar{x}_d}{\frac{s_d}{\sqrt{n}}} = \frac{0.075}{\frac{0.0883}{\sqrt{22}}}\approx 3.98\)

This is the correct approach.

t-Test for Two Independent Samples:

Angle mean St.Dev n
narrow 5.534091 0.2597555 22
wide 5.459091 0.2728319 22

\(t=\frac{(\bar{x}_1-\bar{x}_2)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}=\frac{(5.534-5.456)}{\sqrt{\frac{0.2598^2}{22}+\frac{0.2728^2}{22}}}\approx0.93\)

THIS IS NOT THE CORRECT WAY TO ANALYZE THESE DATA!!!

The independent t-test overestimates the amount of variability.
- Considers differences between runners, rather than the angles

p-values

Paired t-test (Correct)

p-value:

## [1] 0.0006815105

\ \

t-Test for Two Independent Samples: (Incorrect)

p-value:

## [1] 0.3576859

Examples

Should we use procedures for paired data, or independent data?

  1. We are interested in testing whether a yoga class improves flexibility. 25 people participate in the class. Paticipants’ flexibility scores are recorded before and after participating in the class.

  2. We are interested in testing whether listening to music impacts concentration. A sample of 80 college students is randomly divided into two groups. Both groups read the same passage from a textbook, but one group reads it while listening to music and the other reads it in silence. Students then take a quiz to see what they remembered and quiz scores are compared.

  3. We are interested in testing whether a new regulation has had an impact on carbon emissions. We collect data on 50 different factories and record their carbon emissions the year before and the year after the regulation was passed.

  4. We are interested in assessing whether mercury contamination levels in lakes differ between the summer and fall. We visit a sample of 53 lakes and measure their mercury levels in July, and then revisit the same lakes in October and measure the mercury levels again.

  5. A school board wants to determine whether there is a difference in average ACT scores between two scores in the district. They take a simple random sample of 100 ACT takers students from each school, and compare their scores.

Video 3.4 Learning Outcomes

  1. Identify instances of Simpson’s paradox and explain how these occur.

Power of Data

  • More data have been recorded in the last two years than all previous human existence (Forbes magazine)

  • Data are used to:

    • develop personalized cancer therapies
    • make businesses profitable
    • find winning strategies in sports
    • much more
  • With great power comes great responsibility

    • too often data are misused, either accidentally or nefariously
    • according to Nature more than 50% of scientific studies cannot be reproduced, and more than 25% using p < 0.05 cutoff produce false results.
    • “There are three kinds of lies: lies, damned lies, and statistics.” - Mark Twain (and others)

Hospital Survival Rates

Example from Introduction to Statistical Investigations by Tintle et al. 

All Patients:

Survived Died
Hospital A 800 200
Hospital B 900 100

Broken Down by Health Condition

We now break down survial rates by health of patients at the time of admission to the hospital .

Patients in good condition (non-life threatening illness or injury):

Survived Died
Hospital A 590 (98.3%) 10 (1.7%)
Hospital B 870 (96.7%) 30 (3.3%)

\ \

Patients in poor condition (serious, possibly life-threatening condition):

Survived Died
Hospital A 210 (52.5%) 190 (47.5%)
Hospital B 30 (30%) 70 (70%)

Conclusion

Although the overall survival rate is higher at Hospital B, Hospital A has higher survival rates for patients who are in both good and poor health.

Regardless of a patient’s condition, they have a better chance for survival at Hospital A.

The fact that Hospital B has a higher overall survival rate is due to the fact that most of its patients are in good health upon admission, while a high percentage of Hospital A’s patients are admitted in poor health.

Simpson’s Paradox

The hospital example is an instance of Simpson’s Paradox

Simpson’s paradox occurs when an overall trend appears to “reverse” when data are broken down into subgroups or categories.

Simpson’s paradox has appeared in data involving:
1. College admissions
2. Medical data
3. Sports statistics
4. Court convictions

and many more.

Video 3.5 Learning Outcomes

  1. Draw appropriate conclusions in instances of multiple testing and explain how to use Bonferroni’s correction in these situations.

Multiple Testing

A 2008 article from Newscientist.com ran the headline “Breakfast Cereals Boost the Chances of Conceiving Boys”(https://www.newscientist.com/article/dn13754-breakfast-cereals-boost-chances-of-conceiving-boys/)

Note: In addition to statistical fallacies, this article contains problematic assumptions about gender, among other topics. The purpose of examining this article is to illuminate the real and troubling ways in which misuse and misrepresentation of data can be harmful, especially when combined with problematic assumptions in society.

The article claims that women who eat breakfast cereal before becoming pregnant, are significantly more likely to conceive boys. The researchers kept track of 133 foods and, for each food, tested whether there was a difference in the proportion conceiving boys between women who ate the food and women who did not. Of all the foods, only breakfast cereal resulted in a p-value less than 0.01.

Should we conclude that women who eat cereal are more likely to conceive boys? Explain.

Multiple Testing

The breakfast cereal conclusion demonstrates a pitfall, known as multiple testing error.

In fact, if we test 100 different foods, then we would expect 5 of them to yield a p-value less than 0.05 and 1 to yield a p-value less than 0.01 just by chance.

In order to correct for this, researchers should use a lower cutoff value (level of significance) in order to reject the null hypothesis.

Bonferroni Correction

One way to correct for multiple testing error is the Bonferroni correction. Instead of using 0.05, use \(\frac{0.05}{\# \text{ tests}}\)

Example: In the breakfast cereal example, only reject null hypothesis if p-value \(\frac{0.05}{133}=0.000376\)

Video 3.6

  1. Explain the difference between statistical significance and practical importance.

Statistical Significance vs Practical Importance

  • “(S)cientists have embraced and even avidly pursued meaningless differences solely because they are statistically significant, and have ignored important effects because they failed to pass the screen of statistical significance…It is a safe bet that people have suffered or died because scientists (and editors, regulators, journalists and others) have used significance tests to interpret results, and have consequently failed to identify the most beneficial courses of action.” -ASA statement on p-values, 2016

Smoking and Birthweight

We consider a data on a random sample of 80 babies born in North Carolina in 2004. Thirty were born to mothers who were smokers, while fifty were born to mothers who were nonsmokers.

We are interested in studying whether there is evidence of a difference in average birthweight between babies born to smokers and nonsmokers.

Smoker Birthweight Data

habit Mean_Weight SD n
nonsmoker 7.039200 1.709388 50
smoker 6.616333 1.106418 30

Hypotheses

Let \(\mu_1\) represent mean birthweight for babies with mothers are nonsmokers.

Let \(\mu_2\) represent mean birthweight for babies with mothers are smokers.

Null Hypothesis: There is no difference between mean birthweight for babies with mothers who smoke, compared to babies with mothers who do not. (\(\mu_1=\mu_2\))

Alternative Hypothesis: There is a difference between mean birthweight for babies with mothers who smoke, compared to babies with mothers who do not. (\(\mu_1\neq\mu_2\))

Test and Interval

Since samples are not too small (both \(\geq 30\)) and not heavily skewed, t-distribution is appropriate.

t.test(data=NCBirths, weight~habit, alternative="two.sided", conf.level=0.95)
## 
##  Welch Two Sample t-test
## 
## data:  weight by habit
## t = 1.3423, df = 77.486, p-value = 0.1834
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 95 percent confidence interval:
##  -0.2043806  1.0501140
## sample estimates:
## mean in group nonsmoker    mean in group smoker 
##                7.039200                6.616333

Conclusion

Do the data provide evidence that of differences in birthweights for babies born to smokers, compared to nonsmokers?

Context

Many studies have shown that a mother’s smoking puts a baby at risk of low birthweight. Do our results contradict this research? Why or why not?

Impact of Small Sample Size

Notice that we observed a difference of about 0.5 lbs. in mean birthweight. A 0.5 lb. difference is considerable for baby weights.

It would be highly inappropriate to say that our data suggest that there is no difference in birthweights for babies of smokers, compared to nonsmokers.

Our large p-value is due to the fact that our sample size is too small. It does NOT suggest that there is no difference.

This is yet another example of why we should never “accept the null hypothesis” or say that our data “support the null hypothesis.”

Larger Dataset

In fact, this sample of 80 babies is part of a larger dataset, consisting of 1,000 babies born in NC in 2004. When we consider the full dataset, notice that the difference between the groups is similar, but the p-value is much smaller.

t.test(data=ncbirths, weight~habit, alternative="two.sided", conf.level=0.95)
## 
##  Welch Two Sample t-test
## 
## data:  weight by habit
## t = 2.359, df = 171.32, p-value = 0.01945
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 95 percent confidence interval:
##  0.05151165 0.57957328
## sample estimates:
## mean in group nonsmoker    mean in group smoker 
##                7.144273                6.828730

Flights from New York to Chicago

A traveler lives in New York and wants to fly to Chicago. They consider flying out of two New York airports:

  • Newark (EWR)
  • LaGuardia (LGA)

We have data on the times of flights from both airports to Chicago’s O’Hare airport from 2013 (more than 14,000 flights).

Assuming these flights represent a random sample of all flights from these airports to Chicago, consider how the traveler might use this information to decide which airport to fly out of.

New York to Chicago Flights

origin Mean_Airtime SD n
EWR 113.2603 9.987122 5828
LGA 115.7998 9.865270 8507

Hypotheses

Let \(\mu_1\) represent mean airtime for flights out of Newark (EWR).

Let \(\mu_2\) represent mean airtime for flights out of LaGuardia (LGA).

Null Hypothesis: Mean flight time from LaGuardia to O’Hare is the same as from Newark to O’Hare (\(\mu_1=\mu_2\))

Alternative Hypothesis: Mean flight time to O’Hare differs between the two New York airports (\(\mu_1\neq\mu_2\))

Test and Interval

Because sample sizes are large (much greater than 30), we can use the t-distribution.

t.test(data=Flights1, air_time~origin, alternative="two.sided", conf.level=0.95)
## 
##  Welch Two Sample t-test
## 
## data:  air_time by origin
## t = -15.028, df = 12419, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group EWR and group LGA is not equal to 0
## 95 percent confidence interval:
##  -2.870747 -2.208287
## sample estimates:
## mean in group EWR mean in group LGA 
##          113.2603          115.7998

Conclusion

Do the data provide evidence that mean flight time to O’Hare differs between the two New York airports?

How important would this information be for you when deciding which New York airport to fly out of?

Summary

The low p-value gives us strong evidence of a difference in mean flight times between the two New York airports.

We can be 95% confidence that average flight time to Chicago O’Hare is between 2.2 to 2.9 minutes faster for flights out of Newark, than flights out of LaGuardia.

In reality, this difference is practically meaningless.

The low p-value is due to the very large sample size.

Statistical Significance vs Practical Importance

  • A p-value only tells us part of the story.

  • low p-value tells us a difference would not likely have occurred by chance

  • does not tell us size of difference or whether it is meaningful

  • when sample size is large, even small differences yield small p-values

Cautions and Advice

p-values are only (a small) part of a statistical analysis.

  • For small samples, real differences might not be statistically significant.
    -Don’t accept null hypothesis. Gather more information.
  • For large, even very small differences will be statistically significant.
    -Look at confidence interval. Is difference practically important?
  • When many hypotheses are tested at once (such as many food items) some will produce a significant result just by change.
    -Use a multiple testing correction, such as Bonferroni
  • Interpret p-values on a “sliding scale”
    • 0.049 is practically the same as 0.051
  • Is sample representative of larger population?
  • Were treatments randomly assigned (for experiments)?
  • Are there other variables to consider?

Module 4: Regression and Modeling

Video 4.1 Learning Outcomes

  1. Describe scatterplots including strength, direction, and form of the relationship between the variables in the plot. Also identify unusual observations.
  2. Explain how the line of best fit is obtained.

Regression

We will now consider situations where we wish to compare two quantitative variables. For example

How is the price of a car related to the amount of time it takes a car to accelerate from 0 to 60 miles per hour?

The variable we are interested in predicting is called the response variable and is plotted on the y-axis. (price)

The variable we are using to make predictions is called the explanatory variable (or predictor variable) and is plotted on the x-axis. (acceleration time)

Such problems can be investigated using linear regression models.

Price and Acceleration Time in 2015 New Cars

Describing Scatterplots

When describing relationships between quantitative variables, we should consider:

  • direction (positive or negative)
  • form (is the trend roughly linear)
  • strength (do the data follow a clear pattern or are they scattered without clear form)
  • Unusual Observations

Correlation Coefficient

The correlation coefficient measures the strength and direction between two variables.

Correlation between Price and Acc. Time

cor(SmallCars$LowPrice, SmallCars$Acc060)
## [1] -0.8230937

There is a fairly strong negative linear association between accceleration time and price.

Cautions about Correlation Coefficient

A correlation between two variables does not imply there is a causal relationship.

Correlation only describes the strength of a linear relationship. \(r=0\) means there is no linear association, but there could still be a nonlinear relationship.

Example:

Line of Best Fit

Line of Best Fit (Cont.)

  • The Corr/Regression applet illustrates how the line of best fit is chosen.

  • A residual is the difference between the observed response and predicted response.

Residual = Observed - Predicted

The line of best fit is determined by minimizing the sum of the squared residuals.

Video 4.2 Learning Outcomes

  1. Identify the slope, intercept, and correlation, and coefficient of determination for a linear regression model using R output.
  2. Interpret slope, intercept, correlation, and coefficient of determination in the context of a real life problem.
  3. Use simple linear regression to predict the value of a response variable for a given value of the explanatory variable.
  4. Recognize the risk of extrapolation and explain when it is inappropriate to make predictions using a regression line.

Regression Equation

The regression equation has the form \(y=b_0+b_1x\).

\(b_0\) represents the y-intercept of the regression line. This is the expected value of the response variable when the explanatory varaible is 0.

In this case, that would be the expected price of a car that can accelerate from 0 to 60 mph in 0 seconds.

\(b_1\) represents the slope of the regression line. This is the expected change in the response variable when the explanatory variable increases by 1 unit.

In this case, that would be expected change in price for each additional second it takes a car to accelerate from 0 to 60 mph.

Linear Model in R

M <- lm(data=SmallCars, LowPrice~Acc060)
summary(M)
## 
## Call:
## lm(formula = LowPrice ~ Acc060, data = SmallCars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.487  -5.921   0.855   3.404  30.141 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  79.3886     5.5566  14.287 < 0.0000000000000002 ***
## Acc060       -6.1535     0.6329  -9.723     0.00000000000125 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.267 on 45 degrees of freedom
## Multiple R-squared:  0.6775, Adjusted R-squared:  0.6703 
## F-statistic: 94.53 on 1 and 45 DF,  p-value: 0.000000000001246

Interpretations in Regression

The regression equation is:

\(\text{Expected Price} = 79.39 - 6.1535 \times \text{Acceleration Time}\)

Interpretation of y-intercept

The expected price of a car that can accelerate from 0 to 60 miles per hour in 0 seconds is 79.39 thousand dollars. This does not make sense in context, so the intercept does not have a meaningful interpretation in this situation.

Interpretation of slope The expected price of a car is expected to decrease by 6.15 thousand dollars for each additional second it takes to accelerate from 0 to 60 mph.

Coefficient of Determination The value “Multiple R-squared” = 0.6775 means that 67.75% of the variation in price is explained by our model based on acceleration time.

Making Predictions

The regression equation is:

\(\text{Expected Price} = 79.39 - 6.1535 \times \text{Acceleration Time}\)

Examples:

  1. The expected price of a small car that takes 8 seconds to accelerate from 0 to 60 mph is \(79.39-6.1535\times 8 = 30.162\) thousand dollars.

  2. The expected price of a small car that takes 10 seconds to accelerate from 0 to 60 mph is \(79.39-6.1535\times 10 = 17.855\) thousand dollars.

  3. We should not attempt to predict the price of a small car that takes 15 seconds to accelerate, since 15 lies outside the range of our data. This is called extrapolation, which is dangerous.

Video 4.3 Learning Outcomes

  1. Write null and alternative hypotheses in the regression setting.
  2. Explain how to use simulation to test hypotheses in regression.

Height and Footlength

We have data on the height (in inches), and footlength (in cm.) for a sample of 20 people.

Measurements for the first 6 people are shown.

##   footlength height
## 2         32     74
## 3         24     66
## 4         29     77
## 5         30     67
## 6         24     56
## 7         26     65

Height and Footlength Scatterplot

Slope:

## [1] 1.033259

On average, a 1 cm increase in footlength is associated with a 1.03 inch increase in height.

Simulation-Based Hypothesis Test

We need to determine whether we could have plausibly obtained a slope as extreme as 1.03 by chance, when there was really no relationship.

Null Hypothesis: There is no relationship between height and footlength (slope=0).

Alternative Hypothesis: There is a relationship between height and footlength. (slope\(\neq\) 0)

How might we simulate a dataset where there is no relationship between height and footlength?

Shuffled Data

footlength height ShuffledHeight
2 32 74 77
3 24 66 65
4 29 77 71
5 30 67 66
6 24 56 65
7 26 65 64

Randomized Height Scatterplot

Slope:

## [1] 0.01330377

Applet

This Rossman-Chance applet can be used to repeatedly simulate shuffled data.

Repeated Simulations

Proportion of simulations with slope exceeding 1.03:

## [1] 0.0008

Conclusion

The probability of observing a slope as extreme as 1.03 if there is really no relationship between height and footlength is 0.0008.

There is strong evidence of a relationship between height and footlength.

Video 4.4 Learning Outcome

  1. Calculate and interpret test statistics and p-values for the slope of a regression line, using R-output.

  2. Calculate and interpret a confidence interval for the slope of a regression line , using R-output.

Theory-based Test

We saw that the simulation-based null distribution for the slope is symmetric and roughly bell-shaped, so we can approximate it using a t-distribution.

We calculate a t-statistic using the formula

\(t=\frac{\text{slope}}{\text{Standard Error(Slope)}}\)

We will get these quantities from R.

Linear Regression in R

M <- lm(height~footlength, data=FootHeight)
summary(M)
## 
## Call:
## lm(formula = height ~ footlength, data = FootHeight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1003 -2.2251 -0.7833  2.1330  8.7334 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.3021     6.9050   5.547 2.89e-05 ***
## footlength    1.0333     0.2406   4.294 0.000437 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.613 on 18 degrees of freedom
## Multiple R-squared:  0.506,  Adjusted R-squared:  0.4786 
## F-statistic: 18.44 on 1 and 18 DF,  p-value: 0.0004367

Hypothesis Tests in R

We test whether there is evidence of a relationship between footlength and height.

Null hypothesis: There is no relationship between height and footlength. (slope=0)

Alternative hypothesis: There is a relationship between height and footlength. (slope\(\neq 0\))

t-statistic: 4.294

p-value: 0.000437

The p-value tells us the probability of observing a slope as extreme as 1.03 if there is really no relationship between height and footlength.

Since this p-value is extremely low, we have strong evidence of a relationship between height and footlength.

Confidence Interval for Slope

confint(M, "footlength")
##                2.5 %   97.5 %
## footlength 0.5277435 1.538775

We can be 95% confident that a 1 cm. increase in the length of a person’s foot is associated with an increase in height between 0.53 and 1.54 inches.

Video 4.5 Learning Outcomes

  1. Use simple linear regression to predict the value of a response variable for a given value of the explanatory variable.
  2. Interpret confidence intervals and prediction intervals for predicted values.
  3. Determine whether it is appropriate to use linear regression, given a scatterplot.

Making Predictions in R

predict(M, data.frame(footlength=32))
##        1 
## 71.36641

Making Predictions in R (cont.)

predict(M, data.frame(footlength=50))
##        1 
## 89.96508

We should be careful to only make predictions inside the range of values for the explanatory variable that are available in the dataset. Going outside this range is called extrapolation and can lead to nonsensical predictions.

Intervals Associated with Regression

For a response variable (Y) and explanatory variable (X):

A confidence interval tells us a reasonable range for the average value of Y for all observations with the given value of X.

 - Example: Estimate the average height of all people with footlength 32 cm. 

A prediction interval tells us a reasonable range for the value of Y for an individual with the given value of X.

 - Example: Estimate the height of my neighbor, who I know has a footlength of 32 cm.  
 

A prediction interval must account for both variability associated with the selection of the sample, and variability between individuals. A confidence interval only needs to account for variability associated with the selection of the sample

Confidence Interval for Average Height

predict(M, newdata=data.frame(footlength=32), conf.level=0.95, interval="confidence")
##        fit      lwr      upr
## 1 71.36641 68.91453 73.81829

We are 95% confident that the average height for all people with footlength 32 cm. is between 68.9 and 73.8 inches tall.

Prediction Interval for Individual Height

predict(M, newdata=data.frame(footlength=32), conf.level=0.95, interval="prediction")
##        fit     lwr      upr
## 1 71.36641 63.3891 79.34372

We are 95% confident an individual with footlength of 32 cm. will be between 63.4 and 79.3 in. tall.

Confidence and Prediction Interval

Cautions about Regression

It is not always appropriate to use simple linear regression (that is regression with one explanatory variable) to model the relationship between two quantitive variables.

The following slides illustration where simple linear regression is not appropriate.

Nonlinear Pattern

A nonlinear pattern indicates that a more complicated model should be used, rather than a simple linear regression model.

Funnel Shape

A “funnel shape” indicates a lack of constant variability, which will throw off confidence intervals and hypothesis tests associated with a simple linear regression model.

Outliers

An outlier can throw off the general trend in the data

Lack of Independence

It is also inappropriate to use simple linear regression if certain observations are more highly correlated with one another than others.

For example:

  1. People in the study who are related.
  2. Some plants grown in the same greenhouse and others in different greenhouses.
  3. Some observations taken in same time period and others at different times.

All of these require more complicated models that account for correlation using spatial and time structure.

Video 4.6 Learning Outcomes

  1. Explain the regression effect.

The Regression Effect

Exam 1 vs Exam 2 scores for intro stat students at another college

What is the relationship between scores on the two exams?

The Regression Effect

Exam 1 vs Exam 2 scores for intro stat students at another college

How many of the 6 students who scored below 70 on Exam 1 improved their scores on Exam 2?

How many of the 7 students who scored above 90 improved on Exam 2?

The Regression Effect

A low score on an exam is often the result of both poor preparation and bad luck.

A high score often results from both good preparation and good luck.

While changes in study habits and preparation likely explain some improvement in low scores, we would also expect the lowest performers to improve simply because of better luck.

Likewise, some of the highest performers may simply not be as lucky on exam 2, so a small dropoff should not be interpreted as weaker understanding of the exam material.

Simulating Regression Effect

This simulation shows that the lowest scorers often improve, while the highest scorers often dropoff by chance alone.

This phenomon is called the regression effect.

Another Example

Wins by NFL teams in 2017 and 2018

Other Examples of Regression Effect

A 1973 article by Kahneman, D. and Tversky, A., “On the Psychology of Prediction,” Pysch. Rev. 80:237-251 describes an instance of the regression effect in the training of Israeli air force pilots.

Trainees were praised after performing well and criticized after performing badly. The flight instructors observed that “high praise for good execution of complex maneuvers typically results in a decrement of performance on the next try.”

Kahneman and Tversky write that :

“We normally reinforce others when their behavior is good and punish them when their behavior is bad. By regression alone, therefore, they [the trainees] are most likely to improve after being punished and most likely to deteriorate after being rewarded. Consequently, we are exposed to a lifetime schedule in which we are most often rewarded for punishing others, and punished for rewarding.”

Video 4.7 Learning Outcomes

  1. Interpret the partial slopes in a multiple regression model using R output.
  2. Predict the value of a response variable using multiple regression output in R.

SAT Scores Dataset

We’ll now look at a dataset containing education data on all 50 states. Among the variables are average SAT score, average teacher salary, and fraction of students who took the SAT.

head(SAT)
##        state expend ratio salary frac verbal math  sat
## 1    Alabama  4.405  17.2 31.144    8    491  538 1029
## 2     Alaska  8.963  17.6 47.951   47    445  489  934
## 3    Arizona  4.778  19.3 32.175   27    448  496  944
## 4   Arkansas  4.459  17.1 28.934    6    482  523 1005
## 5 California  4.992  24.0 41.078   45    417  485  902
## 6   Colorado  5.443  18.4 34.571   29    462  518  980

Teacher Salaries and SAT Scores

The plot displays average SAT score against average teacher salary for all 50 US states.

What conclusion do you draw from the plot?

Are these results surprising?

Simple Linear Regression Model

M <- lm(data=SAT, sat~salary)
summary(M)
## 
## Call:
## lm(formula = sat ~ salary, data = SAT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -147.125  -45.354    4.073   42.193  125.279 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1158.859     57.659  20.098  < 2e-16 ***
## salary        -5.540      1.632  -3.394  0.00139 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.89 on 48 degrees of freedom
## Multiple R-squared:  0.1935, Adjusted R-squared:  0.1767 
## F-statistic: 11.52 on 1 and 48 DF,  p-value: 0.001391

A Closer Look

Let’s break the data down by the percentage of students who take the SAT.

Low = 0%-22%
Medium = 22-49%
High = 49-81%

SAT <- mutate(SAT, fracgrp = cut(frac, 
      breaks=c(0, 22, 49, 81), 
      labels=c("low", "medium", "high")))

A Closer Look

Now what conclusions do you draw from the plots?

Multiple Regression Model

M <- lm(data=SAT, sat~salary+frac)
summary(M)
## 
## Call:
## lm(formula = sat ~ salary + frac, data = SAT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.313 -26.731   3.168  18.951  75.590 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 987.9005    31.8775  30.991   <2e-16 ***
## salary        2.1804     1.0291   2.119   0.0394 *  
## frac         -2.7787     0.2285 -12.163    4e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.69 on 47 degrees of freedom
## Multiple R-squared:  0.8056, Adjusted R-squared:  0.7973 
## F-statistic: 97.36 on 2 and 47 DF,  p-value: < 2.2e-16

Regression Line and Predictions

\(\text{Expected Average SAT} =\) \(987.9 + 2.18 \times \text{Average Teacher Salary} - 2.78 \times \text{Percentage taking SAT}\)

If a state has an average teacher salary of 40 thousand dollars and 30% of students take the SAT, the expected average SAT score would be

\(987.9 + 2.18 \times 40 -2.78 \times 30 = 991.7\)

predict(M, newdata=data.frame(salary=40, frac=30))
##        1 
## 991.7554

Interpretations

On average, a $1,000 dollar increase in average teacher salary is associated with a 2 point increase in average SAT score assuming there is no change in fraction of students taking the SAT.

On average, a 1% increase in percentage of students taking the SAT is associated with a 2.78 point decrease in average SAT score assuming average teacher salary is held constant.

—>

Module 5: Comparing more than two groups

Video 5.1 Learning Outcomes

  1. State the null and alternative hypotheses associated with a \(\chi^2\) test involving categorical data.
  2. Interpret the p-value and draw conclusions from \(\chi^2\) tests involving categorical data.

Data with More than 2 Categories

A 1992 study by Chase and Dummer asked a random sample of students in grades 4-6, in the state of Michigan, which they thought was most importance: grades, being popular, or playing sports.

Questions of interest:

  • Do priorities differ between 4th, 5th, and 6th grade students?
  • Do priorities differ between students in rural, suburban, and urban schools?

Grade Level Hypotheses

Null Hypothesis: There are no differences in preferences for all 4th, 5th, and 6th graders.

Alternative Hypothesis: There are differences in preferences between grade levels.

Breakdown by Grade Level

Counts:

4 5 6
Grades 49 50 69
Popular 24 36 38
Sports 19 22 28

Percentages:

4 5 6
Grades 53 46 51
Popular 26 33 28
Sports 21 20 21

. . . .

\(\chi^2\) Test

  • A \(\chi^2\) statistic tells us how “different” proportions are across groups. The larger the \(\chi^2\) value, the stronger the evidence of differences.

  • The \(\chi^2\) distribution is appropriate when each cell in the table has at least 5 observations.

Chi-Square Test for Grade Level

chisq.test(T)
## 
##  Pearson's Chi-squared test
## 
## data:  T
## X-squared = 1.5126, df = 4, p-value = 0.8244

Grade Level Conclusion

It would not be at all unusual to observe differences between grade levels as extreme as we saw in the table if there were really no differences at all. There is not evidence of differences between groups.

Location Hypotheses

Null Hypothesis: There are no differences in preferences between students in rural, suburban, and urban settings.

Alternative Hypothesis: There are differences in preferences between settings.

Breakdown by Location

Counts:

Rural Suburban Urban
Grades 57 87 24
Popular 50 42 6
Sports 42 22 5

Percentages:

Rural Suburban Urban
Grades 38 58 69
Popular 34 28 17
Sports 28 15 14

. . . .

Chi-Square Test for Location

chisq.test(T2)
## 
##  Pearson's Chi-squared test
## 
## data:  T2
## X-squared = 18.564, df = 4, p-value = 0.000957

Location Conclusion

It would be very surprising to observe differences between rural, suburban, and urban settings as extreme as we saw in the table if there were really no differences at all. There is strong evidence of differences between settings.

Video 5.2 Learning Outcomes

  1. State the null and alternative hypotheses associated with an ANOVA F-test.
  2. Interpret the p-value and draw conclusions from ANOVA F-tests.
  3. Explain how F-statistics measure variability within and between groups.

Quantitative Data with Multiple Groups

Researchers are interested in studying how being exposed to light at night impacts weight gain in mice. In a study, 27 mice were assigned to one of three light conditions:

  • LD (standard light/dark cycle)
  • LL (bright light at all times)
  • DM (dim light at night)

After 3 weeks, researchers recorded the body mass gain (in grams) in the mice.

Hypotheses

Null Hypothesis: The mean weight gain is the same for each light/dark setting, considering all mice.

Alternative Hypothesis: The mean weight gain is different for at least one of the light/dark settings.

Conditions for ANOVA F-test

These hypotheses can be tested using a procedure called ANalysis Of VAriance (ANOVA).

  1. Observations are independent.
  2. Data within each group are approximately normally distributed.
  3. Variability is roughly equal between groups.

Mice Weight Gain Results

Light mean_Gain sd n
DM 7.85900 3.009291 10
LD 5.92625 1.899420 8
LL 11.01000 2.623985 9

ANOVA F-Test

An F-test compares the amount of variability between groups to the amount of variability within groups.

Scenario 1 Scenario 2
variation between groups High Low
variation within groups Low High
F Statistic Large Small
Result Evidence of Group Differences No evidence of differences

ANOVA F-test in R

A <- aov(data=Mice, BMGain~Light)
summary(A)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Light        2  113.1   56.54   8.385 0.00173 **
## Residuals   24  161.8    6.74                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

We have evidence that there are differences in mean bodymass gain in mice between the different light/dark settings.

We don’t know yet, which light/dark settings differ significantly from one another.

Pairwise t-tests

We use t-tests to compare each of the three groups.

pairwise.t.test(Mice$BMGain, Mice$Light, p.adj="none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  Mice$BMGain and Mice$Light 
## 
##    DM      LD     
## LD 0.12972 -      
## LL 0.01431 0.00049
## 
## P value adjustment method: none

Bonferroni correction:
Since we are performing 3 comparisons, we should only conclude that there are differences between groups if the p-value is less than 0.05/3= 0.0167. (for an overall 0.05 cutoff)

  • There is evidence that weight gain is different in the the LL condition compared to each of the other two conditions.
  • There is not evidence of differences in weight gain between the DM and LD conditions.

—>